Building a song recommender

Fire up GraphLab Create


In [19]:
import graphlab

Load music data


In [20]:
song_data = graphlab.SFrame('song_data.gl/')

Explore data

Music data shows how many times a user listened to a song, as well as the details of the song.


In [21]:
song_data.head()


Out[21]:
user_id song_id listen_count title artist
b80344d063b5ccb3212f76538
f3d9e43d87dca9e ...
SOAKIMP12A8C130995 1 The Cove Jack Johnson
b80344d063b5ccb3212f76538
f3d9e43d87dca9e ...
SOBBMDR12A8C13253B 2 Entre Dos Aguas Paco De Lucia
b80344d063b5ccb3212f76538
f3d9e43d87dca9e ...
SOBXHDL12A81C204C0 1 Stronger Kanye West
b80344d063b5ccb3212f76538
f3d9e43d87dca9e ...
SOBYHAJ12A6701BF1D 1 Constellations Jack Johnson
b80344d063b5ccb3212f76538
f3d9e43d87dca9e ...
SODACBL12A8C13C273 1 Learn To Fly Foo Fighters
b80344d063b5ccb3212f76538
f3d9e43d87dca9e ...
SODDNQT12A6D4F5F7E 5 Apuesta Por El Rock 'N'
Roll ...
Héroes del Silencio
b80344d063b5ccb3212f76538
f3d9e43d87dca9e ...
SODXRTY12AB0180F3B 1 Paper Gangsta Lady GaGa
b80344d063b5ccb3212f76538
f3d9e43d87dca9e ...
SOFGUAY12AB017B0A8 1 Stacked Actors Foo Fighters
b80344d063b5ccb3212f76538
f3d9e43d87dca9e ...
SOFRQTD12A81C233C0 1 Sehr kosmisch Harmonia
b80344d063b5ccb3212f76538
f3d9e43d87dca9e ...
SOHQWYZ12A6D4FA701 1 Heaven's gonna burn your
eyes ...
Thievery Corporation
feat. Emiliana Torrini ...
song
The Cove - Jack Johnson
Entre Dos Aguas - Paco De
Lucia ...
Stronger - Kanye West
Constellations - Jack
Johnson ...
Learn To Fly - Foo
Fighters ...
Apuesta Por El Rock 'N'
Roll - Héroes del ...
Paper Gangsta - Lady GaGa
Stacked Actors - Foo
Fighters ...
Sehr kosmisch - Harmonia
Heaven's gonna burn your
eyes - Thievery ...
[10 rows x 6 columns]


In [22]:
graphlab.canvas.set_target('ipynb')

In [23]:
song_data['song'].show()



In [24]:
len(song_data)


Out[24]:
1116609

Count number of unique users in the dataset


In [25]:
users = song_data['user_id'].unique()

In [26]:
len(users)


Out[26]:
66346

Create a song recommender


In [27]:
train_data,test_data = song_data.random_split(.8,seed=0)

Simple popularity-based recommender


In [28]:
popularity_model = graphlab.popularity_recommender.create(train_data,
                                                         user_id='user_id',
                                                         item_id='song')


Recsys training: model = popularity
Warning: Ignoring columns song_id, listen_count, title, artist;
    To use one of these as a target column, set target = 
    and use a method that allows the use of a target.
Preparing data set.
    Data has 893580 observations with 66085 users and 9952 items.
    Data prepared in: 0.753416s
893580 observations to process; with 9952 unique items.

Use the popularity model to make some predictions

A popularity model makes the same prediction for all users, so provides no personalization.


In [29]:
popularity_model.recommend(users=[users[0]])


Out[29]:
user_id song score rank
279292bb36dbfc7f505e36ebf
038c81eb1d1d63e ...
Sehr kosmisch - Harmonia 4754.0 1
279292bb36dbfc7f505e36ebf
038c81eb1d1d63e ...
Undo - Björk 4227.0 2
279292bb36dbfc7f505e36ebf
038c81eb1d1d63e ...
You're The One - Dwight
Yoakam ...
3781.0 3
279292bb36dbfc7f505e36ebf
038c81eb1d1d63e ...
Dog Days Are Over (Radio
Edit) - Florence + The ...
3633.0 4
279292bb36dbfc7f505e36ebf
038c81eb1d1d63e ...
Revelry - Kings Of Leon 3527.0 5
279292bb36dbfc7f505e36ebf
038c81eb1d1d63e ...
Horn Concerto No. 4 in E
flat K495: II. Romance ...
3161.0 6
279292bb36dbfc7f505e36ebf
038c81eb1d1d63e ...
Secrets - OneRepublic 3148.0 7
279292bb36dbfc7f505e36ebf
038c81eb1d1d63e ...
Hey_ Soul Sister - Train 2538.0 8
279292bb36dbfc7f505e36ebf
038c81eb1d1d63e ...
Fireflies - Charttraxx
Karaoke ...
2532.0 9
279292bb36dbfc7f505e36ebf
038c81eb1d1d63e ...
Tive Sim - Cartola 2521.0 10
[10 rows x 4 columns]


In [30]:
popularity_model.recommend(users=[users[1]])


Out[30]:
user_id song score rank
c067c22072a17d33310d7223d
7b79f819e48cf42 ...
Sehr kosmisch - Harmonia 4754.0 1
c067c22072a17d33310d7223d
7b79f819e48cf42 ...
Undo - Björk 4227.0 2
c067c22072a17d33310d7223d
7b79f819e48cf42 ...
You're The One - Dwight
Yoakam ...
3781.0 3
c067c22072a17d33310d7223d
7b79f819e48cf42 ...
Dog Days Are Over (Radio
Edit) - Florence + The ...
3633.0 4
c067c22072a17d33310d7223d
7b79f819e48cf42 ...
Revelry - Kings Of Leon 3527.0 5
c067c22072a17d33310d7223d
7b79f819e48cf42 ...
Horn Concerto No. 4 in E
flat K495: II. Romance ...
3161.0 6
c067c22072a17d33310d7223d
7b79f819e48cf42 ...
Secrets - OneRepublic 3148.0 7
c067c22072a17d33310d7223d
7b79f819e48cf42 ...
Hey_ Soul Sister - Train 2538.0 8
c067c22072a17d33310d7223d
7b79f819e48cf42 ...
Fireflies - Charttraxx
Karaoke ...
2532.0 9
c067c22072a17d33310d7223d
7b79f819e48cf42 ...
Tive Sim - Cartola 2521.0 10
[10 rows x 4 columns]

Build a song recommender with personalization

We now create a model that allows us to make personalized recommendations to each user.


In [31]:
personalized_model = graphlab.item_similarity_recommender.create(train_data,
                                                                user_id='user_id',
                                                                item_id='song')


Recsys training: model = item_similarity
Warning: Ignoring columns song_id, listen_count, title, artist;
    To use one of these as a target column, set target = 
    and use a method that allows the use of a target.
Preparing data set.
    Data has 893580 observations with 66085 users and 9952 items.
    Data prepared in: 0.752743s
Training model from provided data.
Gathering per-item and per-user statistics.
+--------------------------------+------------+
| Elapsed Time (Item Statistics) | % Complete |
+--------------------------------+------------+
| 1.768ms                        | 4.5        |
| 25.109ms                       | 100        |
+--------------------------------+------------+
Setting up lookup tables.
Processing data in one pass using dense lookup tables.
+-------------------------------------+------------------+-----------------+
| Elapsed Time (Constructing Lookups) | Total % Complete | Items Processed |
+-------------------------------------+------------------+-----------------+
| 196.201ms                           | 0                | 0               |
| 641.561ms                           | 100              | 9952            |
+-------------------------------------+------------------+-----------------+
Finalizing lookup tables.
Generating candidate set for working with new users.
Finished training in 0.709884s

Applying the personalized model to make song recommendations

As you can see, different users get different recommendations now.


In [54]:
personalized_model.recommend([users[0]])


Out[54]:
user_id song score rank
279292bb36dbfc7f505e36ebf
038c81eb1d1d63e ...
Riot In Cell Block Number
Nine - Dr Feelgood ...
0.0374999940395 1
279292bb36dbfc7f505e36ebf
038c81eb1d1d63e ...
Sei Lá Mangueira -
Elizeth Cardoso ...
0.0331632643938 2
279292bb36dbfc7f505e36ebf
038c81eb1d1d63e ...
The Stallion - Ween 0.0322580635548 3
279292bb36dbfc7f505e36ebf
038c81eb1d1d63e ...
Rain - Subhumans 0.0314159244299 4
279292bb36dbfc7f505e36ebf
038c81eb1d1d63e ...
West One (Shine On Me) -
The Ruts ...
0.0306771993637 5
279292bb36dbfc7f505e36ebf
038c81eb1d1d63e ...
Back Against The Wall -
Cage The Elephant ...
0.0301204770803 6
279292bb36dbfc7f505e36ebf
038c81eb1d1d63e ...
Life Less Frightening -
Rise Against ...
0.0284431129694 7
279292bb36dbfc7f505e36ebf
038c81eb1d1d63e ...
A Beggar On A Beach Of
Gold - Mike And The ...
0.0230024904013 8
279292bb36dbfc7f505e36ebf
038c81eb1d1d63e ...
Audience Of One - Rise
Against ...
0.0193938463926 9
279292bb36dbfc7f505e36ebf
038c81eb1d1d63e ...
Blame It On The Boogie -
The Jacksons ...
0.0189873427153 10
[10 rows x 4 columns]


In [33]:
personalized_model.recommend(users=[users[1]])


Out[33]:
user_id song score rank
c067c22072a17d33310d7223d
7b79f819e48cf42 ...
Grind With Me (Explicit
Version) - Pretty Ricky ...
0.0459424376488 1
c067c22072a17d33310d7223d
7b79f819e48cf42 ...
There Goes My Baby -
Usher ...
0.0331920742989 2
c067c22072a17d33310d7223d
7b79f819e48cf42 ...
Panty Droppa [Intro]
(Album Version) - Trey ...
0.0318566203117 3
c067c22072a17d33310d7223d
7b79f819e48cf42 ...
Nobody (Featuring Athena
Cage) (LP Version) - ...
0.0278467655182 4
c067c22072a17d33310d7223d
7b79f819e48cf42 ...
Youth Against Fascism -
Sonic Youth ...
0.0262914180756 5
c067c22072a17d33310d7223d
7b79f819e48cf42 ...
Nice & Slow - Usher 0.0239639401436 6
c067c22072a17d33310d7223d
7b79f819e48cf42 ...
Making Love (Into The
Night) - Usher ...
0.0238176941872 7
c067c22072a17d33310d7223d
7b79f819e48cf42 ...
Naked - Marques Houston 0.0228925704956 8
c067c22072a17d33310d7223d
7b79f819e48cf42 ...
I.nner Indulgence -
DESTRUCTION ...
0.0220767498016 9
c067c22072a17d33310d7223d
7b79f819e48cf42 ...
Love Lost (Album Version)
- Trey Songz ...
0.0204497694969 10
[10 rows x 4 columns]

We can also apply the model to find similar songs to any song in the dataset


In [34]:
personalized_model.get_similar_items(['With Or Without You - U2'])


Out[34]:
song similar score rank
With Or Without You - U2 I Still Haven't Found
What I'm Looking For ...
0.042857170105 1
With Or Without You - U2 Hold Me_ Thrill Me_ Kiss
Me_ Kill Me - U2 ...
0.0337349176407 2
With Or Without You - U2 Window In The Skies - U2 0.0328358411789 3
With Or Without You - U2 Vertigo - U2 0.0300751924515 4
With Or Without You - U2 Sunday Bloody Sunday - U2 0.0271317958832 5
With Or Without You - U2 Bad - U2 0.0251798629761 6
With Or Without You - U2 A Day Without Me - U2 0.0237154364586 7
With Or Without You - U2 Another Time Another
Place - U2 ...
0.0203251838684 8
With Or Without You - U2 Walk On - U2 0.0202020406723 9
With Or Without You - U2 Get On Your Boots - U2 0.0196850299835 10
[10 rows x 4 columns]


In [35]:
personalized_model.get_similar_items(['Chan Chan (Live) - Buena Vista Social Club'])


Out[35]:
song similar score rank
Chan Chan (Live) - Buena
Vista Social Club ...
Murmullo - Buena Vista
Social Club ...
0.188118815422 1
Chan Chan (Live) - Buena
Vista Social Club ...
La Bayamesa - Buena Vista
Social Club ...
0.18719214201 2
Chan Chan (Live) - Buena
Vista Social Club ...
Amor de Loca Juventud -
Buena Vista Social Club ...
0.184834122658 3
Chan Chan (Live) - Buena
Vista Social Club ...
Diferente - Gotan Project 0.0214592218399 4
Chan Chan (Live) - Buena
Vista Social Club ...
Mistica - Orishas 0.0205761194229 5
Chan Chan (Live) - Buena
Vista Social Club ...
Hotel California - Gipsy
Kings ...
0.0193049907684 6
Chan Chan (Live) - Buena
Vista Social Club ...
Nací Orishas - Orishas 0.0191571116447 7
Chan Chan (Live) - Buena
Vista Social Club ...
Gitana - Willie Colon 0.018796980381 8
Chan Chan (Live) - Buena
Vista Social Club ...
Le Moulin - Yann Tiersen 0.018796980381 9
Chan Chan (Live) - Buena
Vista Social Club ...
Criminal - Gotan Project 0.0187793374062 10
[10 rows x 4 columns]

Quantitative comparison between the models

We now formally compare the popularity and the personalized models using precision-recall curves.


In [52]:
if graphlab.version[:3] >= "1.6":
    model_performance = graphlab.compare(test_data, [popularity_model, personalized_model], user_sample=0.05)
    graphlab.show_comparison(model_performance,[popularity_model, personalized_model])


compare_models: using 2931 users to estimate model performance
PROGRESS: Evaluate model M0
recommendations finished on 1000/2931 queries. users per second: 24533.3
recommendations finished on 2000/2931 queries. users per second: 28354.7
Precision and recall summary statistics by cutoff
+--------+-----------------+------------------+
| cutoff |  mean_precision |   mean_recall    |
+--------+-----------------+------------------+
|   1    | 0.0245649948823 | 0.00745871134817 |
|   2    | 0.0213237802798 | 0.0121873227042  |
|   3    | 0.0199021949278 | 0.0163322509024  |
|   4    | 0.0191914022518 | 0.0212117671176  |
|   5    | 0.0180143295803 | 0.0249458263788  |
|   6    | 0.0171158876379 | 0.0287861499735  |
|   7    | 0.0161329629088 |  0.031411615035  |
|   8    | 0.0155237120437 | 0.0349995194161  |
|   9    | 0.0148223966034 | 0.0376322647177  |
|   10   | 0.0143978164449 | 0.0406398100743  |
+--------+-----------------+------------------+
[10 rows x 3 columns]

PROGRESS: Evaluate model M1
recommendations finished on 1000/2931 queries. users per second: 24990.6
recommendations finished on 2000/2931 queries. users per second: 28238.2
Precision and recall summary statistics by cutoff
+--------+-----------------+-----------------+
| cutoff |  mean_precision |   mean_recall   |
+--------+-----------------+-----------------+
|   1    |  0.176731490959 | 0.0569939755457 |
|   2    |  0.155919481406 | 0.0966656022055 |
|   3    |  0.133629023087 |  0.120391609888 |
|   4    |  0.118730808598 |  0.139808482298 |
|   5    |  0.107540088707 |  0.15624266035  |
|   6    |  0.099567838053 |  0.174464172148 |
|   7    | 0.0928498318468 |  0.189266142516 |
|   8    | 0.0869157284203 |  0.201657364963 |
|   9    | 0.0821865878161 |  0.212629041416 |
|   10   | 0.0776526782668 |  0.222770921978 |
+--------+-----------------+-----------------+
[10 rows x 3 columns]

Model compare metric: precision_recall

The curve shows that the personalized model provides much better performance.


In [39]:
for artist in ['Kanye West', 'Foo Fighters', 'Taylor Swift', 'Lady GaGa']:
    print artist, len(song_data[song_data['artist']==artist]['user_id'].unique())


Kanye West 2522
Foo Fighters 2055
Taylor Swift 3246
Lady GaGa 2928

In [41]:
total_listen_count_by_artist = song_data.groupby(key_columns=['artist'], operations={'total_count': graphlab.aggregate.SUM('listen_count')})

In [43]:
popularity = total_listen_count_by_artist.sort('total_count')

In [44]:
popularity[0]


Out[44]:
{'artist': 'William Tabbert', 'total_count': 14}

In [45]:
popularity[-1]


Out[45]:
{'artist': 'Kings Of Leon', 'total_count': 43218}

In [46]:
subset_test_users = test_data['user_id'].unique()[0:10000]

In [49]:
recommended_songs = personalized_model.recommend(subset_test_users, k=1).groupby(key_columns=['song'], operations={'count': graphlab.aggregate.COUNT()})


recommendations finished on 1000/10000 queries. users per second: 23780.1
recommendations finished on 2000/10000 queries. users per second: 30892.8
recommendations finished on 3000/10000 queries. users per second: 35602.8
recommendations finished on 4000/10000 queries. users per second: 36838.5
recommendations finished on 5000/10000 queries. users per second: 37146.6
recommendations finished on 6000/10000 queries. users per second: 37657
recommendations finished on 7000/10000 queries. users per second: 37836
recommendations finished on 8000/10000 queries. users per second: 38283.7
recommendations finished on 9000/10000 queries. users per second: 38535.2
recommendations finished on 10000/10000 queries. users per second: 36699.5

In [51]:
recommended_songs.sort('count')[-1]


Out[51]:
{'count': 435, 'song': 'Undo - Bj\xc3\xb6rk'}

In [ ]:
graphlab.recommender.